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(54) Method for multistage speech recognition using confidence measures 

(57) In a method for recognizing speech commands, 
in which a group of command words selectable by 
speech commands are defined, a time window is de- 
fined, within v^ich the recognition of the speech com- 
mand is performed. In a first recognition stage in the 
method, the recognition resuft of the first recognition 
stage is selected, for which a first confidence value is 
determined. Further in the method, a first threshold val- 
ue (Y)*is determined, with which said first confidence 
vaJue is compared. If said first confidence value is great- 
er than or equal to said first threshold value (Y), the rec- 
ognition result of the first recognition stage is selected 
as the recognition result of the speech comnr^nd If said 
first confidence value is smaller than said first threshold 
value (Y), a second recognition stage is performed for 
the speech command, wherein said time window is ex- 
tended, and a recognition result is selected for the sec- 
ond recognition stage. A second confidence value is de- 
termined for the recognition result of the second recog- 
nition stage and compared with said threshold value (Y). 
If said second confidence value is greater than or equal 
to said first threshold value (Y), the command word se- 
lected at the second stage is selected as the recognition 
result for the speech comnwid. If said second confi- 
dence value is smaller than said first threshold value (Y), 
a comparison stage Is performed, wherein the first and 
second recognition results are compared to find out at 
which probability they are substantially the same, 
wherein if the probability exceeds a predetermined val- 
ue, the command word selected at the second stage is 
selected as the recognition result for the speech com- 
mand. 
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(54) Method for multistage speech recognition using confidence measures 

(57) In a method for recognizing speech commands, 
in which a group of command words selectable by 
speech comnrvinds are defined, a time window is de- 
fined, within which the recognition of the speech com- 
mand is performed. In a first recognition stage in the 
method, the recognition result of the first recognition 
stage is selected, for which a first confidence value is 
determined. Further in the method, a first threshold val- 
ue (Y) is detennriDied. with which said first confidence 
value is compared..lf said first confidence value is great- 
er than or equal to said first threshold value (Y), the rec- 
ognition result of the first recognition stage is selected 
as the recognition result of the speech comrr>and If said 
first confidence value is smaller than said first threshold 
value (Y), a second recognition stage is performed for 
the speech command, wherein said time window is ex- 
tended, and a recognition result is selected for the sec- 
ond recognition stage. A second confidence value is de- 
termined tor the recognition result of the second recog- 
nition stage and compared with said threshold value (Y). 
If said second confidence value is greater than or equal 
to said first threshold value (Y), the comnrend word se- 
lected at the second stage is selected as the recognition 
result for the speech command. If said secortd confi- 
dence value is smaller than said first threshold value ( Y), 
a comparison stage is performed, wherein the first and 
second recognition results are compared to find out at 
which probability they are substantially the same, 
wherein if the probability exceeds a predetermined val- 
ue, the command word selected at the second stage is 
selected as the recognition result for the speech com* 
mand. 
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Description 

[0001] The present invention relates to a method in 
the recognition of speech as set forth in the preannble of 
the appended claim 1. a speech recognition device as 5 
set forth in the preamble of the appended ckaim 7. and 
a wireless communication device to be controlled by 
speech, as set forth in the preamt>le of the appended 
claim 9. 

[0002] For facilitating the use of wireless communtca- 
Won devices, so-called hands free devices have been 
developed, whereby the wireless comniunication device 
can be controlled by speech. Thus, speech can be used 
to control different functions of the wireless communica- 
tion device, such as turning on/off, transmission/recep- 
Won, control of sound volume, selection of telephone 
number, or answering a call, whereby particularly in the 
use in a vehicle, it is easier for the user to concentrate 
on the driving. 

[0003] One drawback in a wireless communtcalion 20 
device controlled by speech is that speech recognition 
is not fully faultless. In a car, the background noise 
caused by the environment has a high volume, thereby 
makiDg it difficult to recognize speech. Due to the unre- 
liability of the spoGch recognition, users of wireless com- 2S 
munication devices have so far shown relatively Trttle in- 
terest in the control by speech. The recognition capabil- 
ity of present speech recognizers is not particularly 
good, especially under drfftcutt conditions, such as in a 
moving car. where the high volume of background noise 30 
hampers reliable recognition of words substantially. In- 
correct recognitkDn decisions cause nrx>st problems usu- 
ally in the implementation of the user interface, because 
incorrect recognition decisions may start undesired 
functions, such as tenminating a call during a call, which 35 
is naturally particularly disturbing to the user One result 
of an incorrect recognition decision may be that a call is 
connected to an incorrect number. For this reason, the 
user interface is designed in such way that the user usu- 
ally ts asked to repeat a command if the speech recog- ^0 
nizer does not have sufficient certainty of a word uttered 
by the user. 

[0004] Almost all speech recognition devices are 
based on the functional principle that a vMDrd uttered by 
the user is compared, by an usually rather complicated 
method, with a group dl reference words previously 
stored in the memory of the speech recognition devrce. 
Speech recognition devices usually calculate a figure for 
each reference word to describe how much the word ut- 
tered by the user resembles the reference word. The so 
recognition decision is finally made on the basis of these 
figures so that the decision is to select the reference 
word which the uttered word resembles most The best 
known methods in the comparison between the uttered 
word and the reference words are dynamic time warping bs 
(DTW) and the statistical hidden Markov model (HMM). 
[OOOS] In both the DTW and the HMM methods, an 
unknown speech pattern is compared with known refer- 



ence patterns. In dynamic Wme warping, the speech pat- 
tern is divided into several frames, and th e local distance 
between the speech pattern deluded in each frame and 
the corresponding speech segment of the reference pat- 
tern is calculated. This distance is calculated by com- 
paring the speech segment and the corresponding 
speech segnnent of the referer>ce pattern with each oth- 
er, and it is thus a kind of numerical value for the differ- 
ences found in the comparison. For speech segments 
close to each other, a smaller distance is usually ob- 
tained than for speech segments further from each oth- 
er. On the basis of local distances obtained this way, a 
minimum path between the beghning and end points of 
the word are sought by using a DTW algorithm. Thus, 
by dynamic time warping, a distance is obtained be- 
tween the uttered word and the reference word. In the 
HMM method, speech patterns are produced, and this 
stage of speech pattern generating is modelled with a 
stale change model according to the Markov method. 
The state change model in question t5 thus the HMM. 
In this case, speech recognition on received speech pat- 
terns is performed by defining a observation probability 
on the speech pattems according to the hidden Markov 
model. Inspeech recognitton by using the HMM method, 
an HMM model is first fornrwid for each word to bo rec- 
ognized, Le. for each reference word. These HMM nr^od- 
els are stored in the memory of the speech recognition 
device. When the speech recognition device receives 
the speech pattern, a observation probability is calcu- 
lated for each HMM noodel in the memory, and as the 
recognition result, acounterpart word is obtained for the 
HMM model with the greatest observation probability. 
Thus foir each reference word the probability is cateulat- 
ed that it is the word uttered by the user. The above- 
mentioned greatest observation probability describes 
the resemblarvce of the received speech pattern and the 
ck>sest HMM model. Lb. the ctosest reference speech 
pattern. 

[0006] Thus, in present systems the speech reco^i- 
tion device calculates a certain figure for the reference 
words on the basis of the word uttered by the user. In 
the DTW method, the figure is the distance between the 
words, and in the HMM method, the figure is the prob- 
ability for the equality ol the uttered word and the HMM 
model. When the HMM method is used, the speech rec- 
ognition devices are usually set a certain threshold prob- 
ability which the nrK>st probable reference word must 
achieve to make the recognition decision. Another factor 
affecting the recognition decision can be e.g. the differ- 
ence between the probabilities of the most probable and 
the second probable word, which must be sufficiently 
great to make the recognition decision. Thus, it is pos- 
sible that when the background noise has a high vol- 
ume, on the basis of a comnrtand uttered by the user, 
the reference word in the memory, ap. the reference 
word *yes". obtains at each attempt the greatest proba- 
bility in relation to the other reference words, e.g. the 
probability O.B. If the threshokJ probability is for example 
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0.9. the recognition is not accepted and the user may 
have to utter the command several limes until the rec- 
ognition probability threshold is exceeded and the 
speech recognition device accepts the comnrand. even 
though the probability may have been very close to the 5 
acceptable value. This is very disturbing to the user 
[0007] Furthermore, the speech recognition is ham- 
pered by the fact that different users utter the same 
words in different ways» wherein the speech recognition 
device works better when used by one user than when 
used by another user In practice, it is veiy difficult with 
the presently known techniques to adjust the certainty 
levels of speech recognition devices to consider all us- 
ers. When adjusting the required certainty level e.g. for 
the word "yes" in speech recognition devices of prior art. '5 
the required threshold is typically set according to so- 
called worst speakers- Thus, the problem emerges that 
words close to the word "yes* also become incorrectly 
accepted. The problem is aggravated by the fact that in 
some situations, mere background noise may be recog- 
nized as coninr^and words. In speech recognition devic- 
es of prior art, the aim is to find a suitable balance in 
which a certain part of the users have great problems in 
having their words accepted and the number of incor- 
rectly accepted words is sufficiently small. If the speech 2S 
recognition device is adjusted in a way that a minimum 
number of users have problems in having their words 
accepted, this means in practice that the number of in- 
correctly accepted words will increase. Corresponding- 
ly, if the aim is set at as faultless a recognition as pos- 30 
sibfe. an increasing number of users wilt have difficulties 
in having commands uttered by them accepted. 
[0008] In speech reco^ition. errors are generally 
classified in three categories: 

35 

.Insertion Error 

The user says nothing but a command word is rec- 
ognized in spite of this, or the user says a word 
whk;h is not a comnnand word and still a command 
word is recognized. 40 

Deletion Error 

The user says a commarKJ word but nothing is rec- 
ognized. 

45 

Substitution Error 

The command word uttered by the user is recog- 
nized as another command word. 

[0009] In a theoretical optimum solution, the speech so 
recognition device makes rwie of the above-mentioned 
errors. However, in practical situations, as was already 
presented above, the speech recognition device makes 
errors of all the said types. For usability of the user in- 
terface, it is important to design the speech recognition bs 
device in a way that the relative shares of the different 
error types are optimal. For example in speech activa- 
tion, where a speech-activated device waits even for 



hours for a certain activation word, it is important that 
the device is not erroneously activated at random. Fur- 
thermore, it is important that the comnnand words ut- 
tered by the user are recognized at good accuracy. In 
this case, however, it is more important that no en^one- 
ous activations take place. In practice, this means that 
the user must repeat the uttered command word more 
often so that it woukl be recognized correctly at a suffi- 
cient probability. 

[0010] In the recognition of a numerical sequence, al- 
most all errors are equally significant. Any error in the 
recognition of iho numbers in a sequence results in a 
false numcricHl sequence. Also the situation that the us- 
er says ncihrng and still a number is recognized, is in- 
convenient for the user However, a situation \n which 
the user utters a number indistinctly and the number is 
not recognized, can be corrected by the user by unering 
the numbers more distinctly. 

[0011] The recognition of a single command word is 
presently a very typical function ffnplemenled by speech 
recognition. For example, the speech recognition devfce 
may ask the user: "Do you want to receive a call?", to 
which the user is expected to reply either "yes* or "no". 
In such situatior^s where there are very few alterr>ative 
conrtmarKi words, the comnnand words are often recog- 
nized correctly, if at all. In other words, the number of 
substitution errors in such a situation is very small. The 
greatest problem in the recognition of single command 
words is that an uttered conrvnand is not recognized at 
all, or an in-elevant word is recognized as a command 
word. In the following, there are three different alterna- 
tive situations of this exarr^le: 

1) A speech-controlled device asks the user: "Do 
you want to receive a call?*, to whkrfn the user re- 
plies indistinctly: "Yes ,„ ye-". The device does not 
recognize the user's reply and asks the user again: 
"Do you want to receive a call? Say yes or no." Thus 
the user may be easily frustrated, if the device often 
asks the user to repeat the conrvnand word uttered, 

2) The device asks the user again: "Do you want to 
receive a call?", to which the user responds distinct- 
ly "yos". However, the device did not recognize this 
for certain arKJ wants a confirmation: "Did you say 
yes?", to whteh the user replies again "yes". Even 
now, no reliable recognition was made, so the de- 
vice asks again: "Did you say yes?". The user must 
repeat again the reply "yes", for the device to com- 
plete the recognition. 

3) Still in a third example situation, the speech-con- 
trolled device asks the user, it s/he wants to receive 
a call. To this, the user mumbles 6onr>ethtng vague, 
and in spite of this, the device interprets the user's 
utterance as the commartd word "yes" and informs 
the user "All right, the call will be connected". Thus, 
in this situation, the interpretation of the device of 
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the user's vague speech is closer to the word *yes" 
than to the word *no". Consequently, in this situa- 
tion, words that resemble a. command word begin 
to be incorrectly accepted. 

[0012] In speech recognition methods according to 
prior art, it is typical to use in the recognition of the com- 
mand word a time window of fixed length, during which 
the user must utter the command word. In another 
speech recognition method of prior art, the recognition 
probability is calculated for the command word uttered 
by the user, and if this probability does not exceed a 
predetermined threshold value, the user is requested to 
utter the command word again, after which a new cal- 
culation of the recognition probability is performed by 
utilizing the probability calculated in the previous recog- 
nition time. The recognition decisiorj is made, if the 
threshold probability is achieved considering the previ- 
ous probabilities, tn this method, however, the utilization 
of repelilion will easily result in an irtcrease In Ihe chance 
of the above-mentioned insertion error, wherein upon 
repeating a word outside the vocabulary. It is more easily 
recognized as a command word. 
[0013] It is an aim of the present invention to provide 
an improved spocch recognition mothod as woll as a 
speech-controlled wireless communication device in 
which speech recognition is secured In view of prior art. 
The invention is teased on the idea that the recognition 
probability calculated for an uttered command word is 
compared with the probability of backgrournj r>oise. 
wherein the confidence value thus obtained is used to 
deduce whether the recognition was positive. If the cor>- 
fidence value remains below a determined threshold for 
a positive recognition, the time window used in the rec- 
ognition is extended and a new recognition is performed 
for the repeated utterance of the commartd word. It the 
repeated command word is not recognized at a suffi- 
cient confidence value, a comparison between the com- 
mand words uttered by the user is still performed; thus, 
in case the recognitions of the words uttered by the user 
indicate that the user has uttered the same command 
word two times in succession, the recognition is accept- 
ed. The method according to the present invention is 
primarily characterized in what will be presented in the 
characterizing part of the appended claim 1. The speech 
recognition device according to the present invention is 
primarily characterized in what will be presented in the 
characterizing part of the appended claim 7. Further- 
more, the wireless communication according to the 
present invention is characterized in what will be pre- 
sented in the characterizing part of the appended claim 
9. 

[0014] The present invention provides significant ad- 
vantages to speech recognition methods and devices of 
prior art. With the method according to the invention, a 
smaller probability of insertion errors is obtained than is 
possible to achieve with methods of prior art. In the 
method of the invention, when the recognition is not cer- 



tain, the time for interpreting the command word is ex- 
tended, wherein the user has a possibility to repeat the 
command word given. According to the invention, it is 
additionally possible, if necessary, to effectively utilize 
5 the repetition of the command word by making a com- 
parison with the command word uttered earlier by the 
user, which comparison improves the recognition of the 
connmand word significantly. Thus, the number of incor- 
rect recogr>rtk)ns can be significantly reduced. Also the 
10 probability of such situations in which a command word 
is recognized although the user did not utter a command 
word. Is reduced significantly. The method of the Inven- 
tion renders it possible to use such confidence levels In 
which the number of incorrectly recognized command 
IS words is minimal. Users who do not have their speech 
commands easily accepted in solutions of prior art can, 
by repetition of the comnnand word according to the in- 
vention, significantly improve the probability ot accept- 
ance oC speech comnr>ands uttered. 
^ [0015] In the rollowir>g, the invention will be described 
in more detail with reference to the appertded drawings, 
in which 

Fig. 1 shows recognition thresholds used in the 
mothod according to an advantageous em- 
bodiment of the invention. 

Fig. 2 shows a method according to an advanta- 
geous embodiment of the invention In a state 
30 machine presentation. 

Fig. 3 illustrates time warping of feature vectors. 

Fig. 4 illustrates the conrtparison of two words in a 
35 histogram, and 

Fig. 5 illustrates a wireless communication devfce 
according to an advantageous embodiment of 
the invention in a reduced schematic diagram. 

40 

[0016] In the following, we wilt describe the function 
of the speech recognition device according to an advan- 
tageous embodiment of the invention, tn the method, a 
command word uttered by the user is recognized by cal- 
^5 culating. in a way known as such, the probability for how 
dose the uttered word Is to different command words. 
On the basis ot these probabilities calculated for the 
command words, the command word with the greatest 
probability is advantageously selected. After this, the 
so probability calculated for the recognized word is com- 
pared with the probability produced by a background 
rK>ise model. The backgrourK* noise model represents 
general background noise and also alt such words whteh 
are not command words. Thus, a probabirrty Is calculat- 
es ed here for the possibility that the recognized wond is 
only background noise or a word other than a command 
word. On the basis of this comparison, a first confidence 
value is obtained, to Indicate how positively the word 
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was recognized. Figure 1 illustrates the determination 
of the confidence erf this recognition by using threshold 
values and said confidence value. In the nnathod accord- 
ing to an advantageous enr^bodimeni of the invention, a 
first threshold value is determined, indicated in the ap- s 
pended figure with the reference Y. It is the limit deter- 
mined lor the confidence value that the recognition is 
positive (the confidence value is greater than or equal 
to the first threshold value Y). In the method according 
to a secoTKl advantageous embodiment erf the invention. io 
also a second threshold value is determined, indicated 
in the appended figure with the reference A, This indi- 
cates whether the recognition was uncertain (the confi- 
dence value is greater than or equal to the second 
threshold value A but smaller than the first threshold val- is 
ue Y) or very uncertain (the threshold value is snr^ller 
than the second threshold value A). 
[0017] In the state machine presentation of Fig. 2, 
slate 1 represents the recognition of a comnr^nd word. 
Al this stage of recognizing the command word, prot>a- 20 
bilities are determined on the basis of the comnnand 
word uttered by the user for different commarKJ words 
in the vocabulary of the speech recognition device. As 
the comnnand word corresponding to the speech com- 
mand uttered by tho user is selected, proliminarity, the 2S 
comnriand word with the greatest probability. For the se- 
lected command word, said confidence value is deter- 
mined and compared with the first threshold value Y and 
the second threshold value A to deduce whetherthe rec- 
ognition was certain, uncertain or very uncertain. If the 30 
confidence value is greater than or equal to the first 
threshold value 6, the operation nroves on to state 4 to 
accept the recognition. If the confidence value remained 
smaller than the second threshold value A, the operation 
moves on to state 5 to exit from the comnnand word rec- ss 
ognition. i.e. to reject the recognition. If the confidence 
value was greater than or equal to the second threshold 
value A but smaller than the first threshold value Y, the 
recognition was uncertain; and the operatbn moves on 
to state 2. Thus, the time window is extended, /.e. the "^o 
user will have more time to say the uttered command 
word again. The operation can move on to this state 2 
also because of an incorrect word, a^. as a result of a 
word uttered very unclearty by the user or as a result of 
incorrect recognition caused by background noise. In 
this state 2. the repetition of the command word is waited 
for the time of the extended time window. If the user ut- 
tered a comnnand word again in this lime window, the 
command word is recognized and the confidence value 
is calculated, as presented atx>ve in connection with the ^ 
state 1 . If at this stage the calculated confidence value 
indicates that the command word uttered at this second 
stag© is recognized with suffidont confidence, the oper- 
ation moves on to the state 4 and the recognition is ac- 
cepted. For example, in a situation when the user may 55 
have said something vague in the stale 1 but has uttered 
the correct comnnand word clearfy in the state 2, the rec- 
ognition can be made solely on the basis of this com- 



mand word uttered in the state 2. Thus, no comparison 
will be nrwde between the first and second comnnand 
words uttered, because this would easily lead to a less 
secure recognition decision. 

[001 8] Neverthe less, if the command word cannot be 
recognized with sufficient confidence in the state 2. the 
operation nnoves on to state 3 for comparison of the re- 
pealed comnrtand words. If this comparison indicates 
that the comnnand word repeated by the user was very 
close to the command word first said by the user, /.e. 
the same word was probably uttered twice in succes- 
sion, the recognition is accepted and the operation 
rTKxves on to the state 4, However, if the comparison in- 
dicates that the user has probably not said the same 
word twice, the operation nnoves on to the state 5 and 
the recognition is rejected. 

[0019] Consequently, in the method of the invention, 
when the first step indicates an uncertain recognition, a 
second recognition is made preferably by a recognition 
method known as such. If this second step provides no 
sufficient certainty of the recognition, a comparison of 
the repetitions is nnade advantageously in the following 
way. In the state 1 , feature vectors fonmed of the com- 
marKi word uttered by the user are stored in a speech 
response mcnnory 4 (Fig. 5). Such foaturo v€>clors arc 
distinguished from the speech typically at intervals of 
ca 10 nns, i.e. ca. 100 feature vectors per second. Also 
in the state 2. feature vectors formed of the comnnand 
word uttered at this stage are stored in the speech re- 
sponse mennory 4. After this, the recognition moves on 
to the state 3. in which these feature vectors stored in 
the memory are compared preferably by dynamic time 
warping. Figure 3 illustrates this dynamic time warping 
of feature vectors in a reduced nnanner. At the lop of the 
figure are shown feature vectors produced by the first 
recognition, indicated with the reference nunnber VI, 
and correspondingly, at the bottom of the figure are 
shown feature vectors produced by the secorwJ recog- 
nition and indicated with the reference number V2. In 
this example, the first word was lor^ger than the second 
word. i,e. the user has said the word faster at the second 
stage, or the words involved are different Thus, for the 
feature vectors of the shorter word, in this case the sec- 
ond word, one or more corresponding feature vectors 
are found from the longer word by lime warping the fea- 
ture vectors of the two words in a way that they corre- 
spond to each other optimally. In this example, these 
time warping results are indicated by broken lines in Fig. 
3. The distance between the words is calculated e.g. as 
a Euclidean distance between the warped feature vec- 
tors. If the calculated distance is small, it can be as- 
sumed that the words in question are different. Figure 4 
shows an oxannpio of this comparison as a histogram. 
The histogram includes two different comparisons: the 
comparison between two identical words (shown in solid 
lines) and a comparison between two different words 
(shown in broken lines). The horizontal axis is the loga- 
rithmic value of the calculated distance between the f ea- 
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ture vectors, wherein a smatter value represents a 
smpiller dist^^nce, ;^nd the ven\cfi\ axis is the histogram 
value. Thus, the smaller particularly the high histogram 
values are. L&, the calculated distance is very small, the 
higher the probability that the words to be compared are 
the same. 

[0020] Fif^ure 5 shows a wireless communication de- 
vice 1. such as a GSM rrK>bite phone, controlled by 
speech commands according to an advantageous em- 
bodiment of the Invention. Only the most essential 
blocks for understanding the invention are presented in 
Fig. 5. A speech control unit 2 comprises preferably a 
speech recognition means 3» a speech response mem- 
ory 4. a central processing unit 5, a read-only memory 
6, a ranctom access memory 7, a speech synthesizer 6, 
and Interface moans 9. A comnr^rxi word can be en- 
tered e.g. with a microphone 10a in the wireless com- 
munication device 1 or with a microphone 10b in a 
hands-free facility 17. Inst ructions and notices to the us- 
er can be produced e.g. with the speech synthesizer 8 
either via a speaker 11 a integrated in the wireless com- 
muncation device 1 or via a speaker 11 b in the hands- 
free facility 17. The speech control unit 2 according to 
the invention can also be implemented without the 
speech synthesizer 8. wherein instructions and notices 
to the user are transmitted preferably In text format on 
a display means 1 3 in the telecommunication device. 
Yet another possibility is to transmit the instructions and 
notices as both audio and text messages to the user 
The speech response mennory 4 or a part of the same 
can be implemented also in connection with the random 
access merriory 7. or it can be integrated in a possible 
general memory space of the wireless communication 
device. 

[0021] In the following, the operation of the wireless 
communication device according to the Invention will be 
described further. For the speech control to function, the 
speech control unit 2 must be normally taught all the 
command words to be used. These have been taught 
preferably at the stags of manufacturing the device e.g. 
in a way that patterns corresponding to the command 
words are stored in the speech response menr>ory 4. 
[0022] At the stage of recognizing a command word, 
the command word uttered by the user is converted by 
a microphone lOa, lOb to an electrical signal and con- 
veyed to the speech control unit 2. In the speech control 
unit 2. the speech recognition means 3 converts the ut- 
tered command word into feature vectors which are 
stored in the speech response memory 4. The speech 
control unit 2 additionally calculates for each command 
word in the vocabulary of the speech control unit 2 a 
probability to indicate how probably the comrr>and word 
uttered by the user is a certain command word in the 
vocabulary. After this, the speech control unit 2 exam- 
ines which command word in the vocabulary has the 
greatest probability value, wherein this word is prelim i* 
narily selected as a recognized command word. The cal- 
culated probability for this word is still compared with a 



probability produced by the background noise pattern, 
to determine a confidence value. This confidence value 
is compared by the speech control unit 2 with the first 
threshold value Y stored in the mennory of the speech 

s control unit, preferably a read-only memory 6. If the 
comparison indicates that the confidence value Is great- 
er than or equal to the first threshold value Y, the speech 
control unit 2 deduces that the comnrend word uttered 
by the user was the word with the greatest probability 

10 value. The speech control unit 2 converts this command 
word Into a corresponding control signal, e.g, the signal 
for pressing inc button corresponding to the command 
word. yA'hich is trnnsmitted via the interface 9 to the con- 
trol block 15 oi the wireless communication device, or 

?5 ihe like The control block 16 interprets the command 
and executes it as required, in a manner known perse. 
For example, the user can enter a desired telephone 
number by saying it or by pressing the corresponding 
buttons. 

20 [0023] In case ihe comparison above indicated lhal 
the confidence value is snr^ller than the first threshold 
value Y, a second comparison is made to the second 
threshold value A. If the comparison indicates that the 
confid^ce value is greater than or equal to the second 
25 threshold valuo A. tho speech control unit 2 extends tho 
time limit determined for recognizing the command word 
and waits if the user will utter the command word again. 
If the speech control unit 2 detects that the user utters 
a speech command within said time limit, the speech 
30 control unit takes all the measures presented above in 
the description of the advantageous embodiment of the 
invention, i.e. forming the feature vectors and storing 
them in the speech response memory 4, calculating the 
probabilities and a new confidence value. Next, a new 
35 corrparison is made between the confidence value and 
the first threshold value Y If the confidence value is 
greater than or equal to the first threshold value Y, the 
speech control unit 2 interprets that the speech com- 
mand was recognized correctly, wherein the speech 
40 control unit 2 converts the speech command into the 
corresponding control signal and transmits it to the con- 
trol block 16. However, if the confidence value is smaller 
than the first threshold value Y, the speech control unit 
2 makes a comparison between the feature vectors of 
45 the first and second words uttered and stored in the 
speech response menrKjry 4. This comparison involves 
first time warping of the feature vectors and then cak^u- 
lating the distance between the words, as described 
above in connection with the description of the method. 
50 On the basis of these calculated distances, the speech 
control unit 2 deduces whether the words uttered by the 
user were the same or different words. If they were dif- 
ferent words, the speech control unit 2 did not recognize 
the command word and neither will it form a control sig- 
55 nal. If the command words were probably the same, the 
speech control unit 2 converts the command word to the 
corresponding control signal. 

[0024] In case the first comnriand word was not rec- 
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ognized with sufficient conridence and the user does not 
repeat thecomrrand word within the time limit, the com- 
mand word is not accepted. In this case too no control 
signal is transmitted to the control block 16. 
[0025] To increase the convenience of use of the wire- 
less communication device 1 in those cases where the 
first recognition of the command word did not provide a 
sufficiently reliable recognition, the user can be in- 
fomned of the failure of the recogniton of the first stage 
and be requested to utter the command word again. The 
wireless communication device 1 forms e.g. an audio 
message with a speech synthesizer 8 and/or a visual 
message on a display means 1 3. The wireless commu- 
nication device 1 can inform the user with an audio and/ 
or visual signal also in a situation where the recognition 
was successful. Thus it will not remain obscure to the 
user whether the recognition was successful or not. This 
is particularly useful under noisy use conditions. 
[0026] \Afarping of feature vectors and calculating of 
distances between words is prior art known per se, why 
these are not disclosed here in more detail. It is obvious 
that the present invention is not limited solely to the em- 
bodiments presented above but it can be noodified within 
the scope of the appended claims. 



Claims 

1 . A method for recognizing speech commands by us- 
ing a time window, which Is extendable when need- 
ed, in which method a group of command words se- 
lectable by speech commands are defined, a time 
window is defined, within which the recognition of 
the speech command is performed, and a first rec- 
ognition stage is performed, in which the recogni- 
tbn result of the first recognition stage is selected, 
characterized in that further in the method: 

a) a first confidence value is determined for the 
recognition result of the first recognition stage. 

b) a first threshold value (Y) is determined. 

c) said first confidence value is compared with 
said first threshold value (Y), 

d) if said first confidence value is greater than 
or equal to saidfirst threshold value (Y), the rec- 
ognition result of the first recognition stage is 
selected as the recognitton result of the speech 
command, 

e) if said first confidence value is smaller than 
said first threshold value (Y)» a second recog- 
nition stage is performed for the speech com- 
nnand, wherein 

f) said time window is extended, and 

g) a second confidence value is determined for 
the recognition result of the second recognition 
stage. 

i) said second confidence value is compared 
with said threshold value (Y). 



j) if said second confidence value is greater 
than or equal to said first threshold value (Y). 
the command word selected at the second 
stage is selected as the recognition result for 

5 the speech command. 

k) if said second confidence value is smaller 
than said first threshold value (Y), a comparison 
stage is performed, wherein 
I) the first and second recognition results are 
compared to find out at which probability they 
are substantially the same, wherein if the prob- 
ability exceeds a predetermined value, the 
command word selected at the second stage is 
selected as the recognition result for the 

IS speech command. 

2. The method according to claim 1 , characterized in 
that at said recognition stages, a probability is de- 
termined for one or several said command words. 

20 at which the speech command uttered by the user 
corresponds to said command word, wherein the 
command word with the greatest determined prob- 
ability is selected as the recognitwn result of said 
recognition stages. 

25 

3. The method according to claim 1 or 2, character- 
ized in that in the method an additional second 
threshold value (A) is determined, wherein the stag- 
es e) to k) are performed only if said first confidence 

30 value is greater than said second threshold value 
(A). 

4. The method according to claim 3. characterized in 
that the comparison stage k) is performed only if 

^ said second confidence value is greater than said 
second threshold value (A). 

5. The method according to any of the claims 1 to 4, 
characterized in that for determining the first con- 

^ fidence value, a probability is detemiined for the first 
speech command being backgrourKi noise, wherein 
the first confidence value is formed on the basis of 
the probability determined for the command word 
selected as the recognition result of the first recog- 

^5 nition stage, and the background noise probability. 

6. The method according to any of the claims 1 to 5, 
characterized in that for determining the second 
confidence value, a probability is determined for the 

50 second speech command being background noise, 
wherein the second confidence value is formed on 
the basis of the probability determined for the com- 
mand word selected as the recognition result of the 
second recognitton stage, and the background 

S5 noise probability. 

7. A speech recognitbn device, in whk:h a vocabulary 
of selectable command words is defined, the device 
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comprising means (5) for measuring the time used 
for recognition and comparing it with a predeter- 
mined time window, and means (3. 4, 5) for select- 
ing a first recognition result, characterized in that 
the speech recognition device comprises further: s 

means (3. 5) for calculating a first confidence 
value for said first recognition result, 
means (5) for comparing said first confidence 
value with a predetermined first threshold value 
(Y). wherein the recognition result of the first 
recognition stage is arranged to be selected as 
the recognition result of the speech command, 
if said first confiderice value is greater than or 
equal to said first threshold value (Y), and is 
mearts (5) for performing the recognition stage 
of a second speech command, rf said first con- 
fidence value is smaller than said first threshold 
value (Y), the means for performing the recog- 
nition stage of ihe second speech command 
comprising: 

means (5) for extending said time window, 
means (3, 4, 5) for selecting the recognition re- 
suit of the second recognition stage, 
moans (5) for calculating a second confidonco ^5 
value for said second recognition result, 
means (5) for comparing said second confh 
dence value with the predetermined first 
threshold value (Y), wherein the recognition re- 
sult of the second recognition stage is arranged 30 
to be selected as the recognition result of the 
speech command, if said second confidence 
value is greater than or equal to said first 
threshold value (Y), anO 

means (3. 4, 5) for performing a comparison 3S 
stage, the comparison stage being arranged to 
be performed if said second confidence value 
is smaller than said first threshold value (Y). 

8. The speech recognition device according to claim ^0 
7, characterized in that the means for performing 
the comparison stage comprise nrmans (3, 4. 5) for 
comparing the first and second recognition results. 

9. A wireless communication device comprising ^s 
means for recognizing speech commands, in which 

a vocabulary of selectable command words is de- 
fined, the means for recognizing speech commands 
comprising means (5) for measuring the time used 
for recognition and comparing it with a predeter- so 
mined time window, and means (3. 4, 5) for select- 
ing a first recognition result, oharaoterlzed in that 
the mccins for recognizing speech commands com- 
prise further 

55 

means (3. 5) for calculating a first confidence 

value for said first recognition result, 

means (5) for comparing said first confidence 



value with a predetermined first threshold value 
(Y), wherein the recogriitton result of the first 
recognition stage is arranged to be selected as 
the recognition result of the speech command, 
if said first confidence value greater than or 
equal to said first threshold value (Y), and 
nr^ans (5) for performing the recognition stage 
of a second speech commarKl. if said first con- 
fidence value is smaller than said first threshold 
value (Y), the means for performing the recog- 
nition stage of the second speech command 
comprising: 

means (5) for exterxltng said time window, 
means (3. 4. 5) for selecting the recognition re- 
sult of the second recognition stage, 
noeans (5) for calculating a second confidence 
value for said second recognition result, 
means (5) for comparing said second confi- 
dence value with the predetermined first 
threshold value (Y). wherein the recognition re- 
sult of the second recogn it bn stage is arranged 
to be selected as the recognition result of the 
speech command, if said second confiderwe 
value is greater than or equal to said first 
threshold value (Y). and 
means (3, 4, 5) for performing a comparison 
stage, the comparison stage being arranged to 
be performed if said second confidence value 
is smaller than said first threshold value (Y). 
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